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Abstract 

Challenging computer vision tasks, in particular seman¬ 
tic image segmentation, require large training sets of anno¬ 
tated images. While obtaining the actual images is often un¬ 
problematic, creating the necessary annotation is a tedious 
and costly process. Therefore, one often has to work with 
unreliable annotation sources, such as Amazon Mechanical 
Turk or (semi-)automatic algorithmic techniques. 

In this work, we present a Gaussian process (GP) based 
technique for simultaneously identifying which images of a 
training set have unreliable annotation and learning a seg¬ 
mentation model in which the negative effect of these im¬ 
ages is suppressed. Alternatively, the model can also just 
be used to identify the most reliably annotated images from 
the training set, which can then be used for training any 
other segmentation method. 

By relying on "deep features” in combination with a lin¬ 
ear covariance function, our GP can be learned and its hy¬ 
perparameter determined efficiently using only matrix op¬ 
erations and gradient-based optimization. This makes our 
method scalable even to large datasets with several million 
training instances. 

1. Introduction 

The recent emergence of large image datasets has led 
to drastic progress in computer vision. In order to achieve 
state-of-the-art performance for various visual tasks, mod¬ 
els are trained from millions of annotated images MM- 
However, manually creating expert annotation for large 
datasets requires a tremendous amount of resources and is 
often impractical, even with support by major industrial In¬ 
ternet companies. For example, it has been estimated that 
creating bounding box annotation for object detection tasks 
takes 25 seconds per box ETll . and several minutes of hu¬ 
man effort per image can be required to create pixel-wise 
annotation for semantic image segmentation tasks M- 

In order to facilitate the data annotation process and 
make it manageable, researchers often utilize sources of an¬ 
notation that are less reliable but that scale more easily to 
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Figure 1: Learning with unreliable image annotations: our 
Gaussian process based method jointly estimates a distri¬ 
bution over prediction models (blue area; 95% confidence 
region, red line: most likely model), and a confidence value 
for the annotation of each image in the training data. 


large amounts of data. For example, one harvests images 
from Internet search engines m or uses Amazon Mechan¬ 
ical Turk (MTurk) to create annotation. Another approach 
is to create annotation is a (semi-)automatic way, e.g. using 
knowledge transfer methods lUoiini. 

A downside of such cheap data sources, in particular au¬ 
tomatically created annotations, is that they can contain a 
substantial amount of mistakes. Moreover, these mistakes 
are often strongly correlated; for example, MTurk workers 
will make similar annotation errors in all images they han¬ 
dle, and an automatic tool, such as segmentation transfer, 
will work better on some classes of images than others. 

Using such noisily annotated data for training can lead to 
suboptimal performance. As a consequence, many learning 
techniques try to identify and suppress the wrong or unre¬ 
liable annotations in the dataset before training. However, 
this leads to a classical chicken-and-egg problem: one needs 
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a good data model to identify mislabeled parts of data and 
one needs reliable data to estimate a good model. 

Our contribution is this work is a Gaussian processes 
(GP) II 20 I treatment of the problem of learning with un¬ 
reliable annotation. It avoids the above chicken-and-egg 
problem by adopting a Bayesian approach, jointly learn¬ 
ing a distribution of suitable models and confidence val¬ 
ues for each training image (see Figure [^. Afterwards, 
we use the most likely such model to make predictions. 
All (hyper-)parameters are learned from data, so no model- 
selection over free parameter, such as a regularization con¬ 
stant or noise strength, is required. 

We also describe an efficient and optionally distributed 
implementation of Gaussian processes with low-rank co- 
variance matrix that scales to segmentation datasets with 
more than 100,000 images (16 million superpixel training 
instances). Conducting experiments on the task of fore¬ 
ground/background image segmentation with large training 
sets, we demonstrate that the proposed method outperforms 
other approaches for identifying unreliably annotated im¬ 
ages and that this leads to improved segmentation quality. 

1.1. Related work 

The problem of unreliable annotation has appeared in the 
literature previously in different contexts. 

For the task of dataset creation, it has become com¬ 
mon practice to collect data from unreliable sources, such 
as MTurk, but have each sample annotated by more than 
one worker and combine the obtained labels, e.g. by a 
(weighted) majority vote mmsi. For segmentation tasks 
even this strategy can be too costly, and it is not clear 
how annotations could be combined. Instead, it has been 
suggested to have each image annotated only by a sin¬ 
gle worker, but require workers to first fulfill a grading 
task M- When using images retrieved from search en¬ 
gines, it has been suggested to make use of additional avail¬ 
able information, e.g. keywords, to hlter out mislabeled im¬ 
ages ll24l . 

For learning a classifier from unreliable data, the easi¬ 
est option is to ignore the problem and rely on the fact that 
many discriminative learning techniques are to some extent 
robust against label noise. We use this strategy as one of 
the baselines for our experiments in Section]^ hnding how¬ 
ever that it leads to suboptimal results. Alternatively, out¬ 
lier hltering based on the self-learning heuristic is popular; 
a prediction model is hrst trained on all data, then its out¬ 
puts are used to identify a subset of the data consistent with 
the learned model. Afterwards, the model is retrained on 
the subset. Optionally, these steps can be repeated multi¬ 
ple times 0. We use this idea as a second baseline for our 
experiments, showing that it improves the performance, but 
not as much as the method we propose. 

Special variants of popular classihcation methods, such 


as support vector machines and logistic regression, have 
been proposed that are more tolerant to label noise by ex¬ 
plicitly modeling in the objective function the possibility 
of label changes. However, these usually result to more 
difficult optimization problems that need to be solved, and 
they can only be expected to work if certain assumptions 
about the noise are fulhlled, in particular that the label noise 
is statistically independent between different training in¬ 
stances. For an in-depth discussion of these and more meth¬ 
ods we recommend the recent survey on learning with label 
noise JS]. 

Note that recently a method has been proposed by which 
a classiher is able to self-assess the quality of its predic¬ 
tions 1^ . While also based on Gaussian processes this 
work differs significantly from ours: it aims at evaluating 
outputs of a learning system using a GP’s posterior distri¬ 
bution, while in this work our goal is to assess the quality 
of inputs for a learning system, and we do so using the GP’s 
ability to infer hyperparameters from data. 

2. Learning with unreliable annotations 

We are given a training set, V = {(Ij, that 

consists of n pairs of images and segmentations masks. 
Each image Ij is represented as a collection of rj super¬ 
pixels, (xi,... ,Xrj), with Xk & X for each k G {1... Vj}, 
where A is a universe of superpixels. Correspondingly, any 
segmentation mask Mj is a collection {yi,... ,yr ), where 
each yj S 3^ is the semantic label of the superpixel Xj and y 
is a hnite label set. For convenience we combine all super¬ 
pixels and semantic labels from the training data and form 
vectors X and y of length N, denoting individual super¬ 
pixels and semantic labels by a lower index i. In the scope 
of this work we consider foreground-background segmen¬ 
tation problem with y = {-fl,—1}, where -f 1 stands for 
foreground and —1 for background. An extension of our 
technique to the multiclass scenario is possible, but beyond 
the scope of this manuscript. 

The main goal of this work is to learn a prediction func¬ 
tion, f : X ^ y, in presence of a signihcant number of 
mistakes in the labels of the training data. We address this 
learning problem using Gaussian processes. 

2.1. Gaussian processes 

Gaussian processes (GPs) are a prominent Bayesian ma¬ 
chine learning technique, which in particular is able to rea¬ 
son about noise in data and allows principled, gradient- 
based hyperparameter tuning. In this section we reiterate 
key results from the Gaussian processes literature from a 
practitioner’s view. For more complete discussion, see EQl. 

A GP is dehned by a positive-dehnite covariance (or ker¬ 
nel) function, k ■. X x X ^ M., that can depend on hyperpa¬ 
rameters 9. For any test input, x, the GP dehnes a Gaussian 


posterior (or predictive) distribution. 


p{y\x, X, y, 0) = A/" {m{x),a{x)). 


The expression p{y\6, X) in the objective (|^ is known as 
marginal likelihood. Its value and gradient can be computed 
in closed form. 


The mean function, 

m{x) = K;(x)^K~^y, (2) 

allows us to make predictions (by taking its sign), and the 
variance. 


a{x) = K,(x,x) — k{xY^k{x), (3) 

reflects our conhdence in this prediction, where K is 
the N X N covariance matrix of the training data with 
entries K^- = K(Xi,Xj) for i,j G and 

k(x) = [k(Xi, a;),.. ., k(Xjy, a;)]^ S . Note that the 
mean function ([^ is the same as one would obtain from ker¬ 
nel ridge regression ca, which has proven effective also 
for classification tasks ll22l . 

Due to their probabilistic nature, Gaussian processes can 
incorporate uncertainty about labels in the training set. One 
assumes that the label, y, of any training example is per¬ 
turbed by Gaussian noise with zero mean and variance e^. 
Different noise variances for different examples reflect the 
situation in which certain example labels are more trustwor¬ 
thy than others. 

The specific form of the GP allows us to integrate out the 
label noise from the posterior distribution. The integral can 
be computed in closed form, resulting in a new posterior 
distribution with mean function, 

m{x) = V, (4) 

and variance aix) = k{x, x) — nixYXf^Rix), for an aug¬ 
mented covariance matrix = K-|-£, where S is the diag¬ 
onal matrix that contains the noise variances of all training 
example^ We consider potential hyperparameters of E as 
a part of 6. 

2.2. Hyperparameter learning 

A major advantage of GPs over other regression tech¬ 
niques is that their probabilistic interpretation offers a prin¬ 
cipled method for hyperparameter tuning based on contin¬ 
uous, gradient-based optimization instead of partitioning- 
based techniques such as cross-validation. We treat the 
unknown hyperparameters as random variables and study 
the joint probability p{y,9\X) over hyperparameters and 
semantic labels. Employing type-II likelihood estimation 
(see II 20 I . chapter 5), we obtain optimal hyperparameters, 
0*, by solving the following optimization problem, 

0* = argmaxg lnp(y|0,X). (5) 

* Alternatively, we can think of Kf as the data covariance matrix for a 
modified covariance function. 


lnp(y|0, X) = -^ (y^K^ V+ln \Ke | +Mn(27r)) , 


91np(y|0,X) 
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(6) 

(7) 


where a = ^y, 0 is any entry of 0, is an element¬ 

wise partial derivative and jK^I denotes the determinant 
of K^. If the entries of depend smoothly on 0 then 
the maximization problem (|^ is also smooth and one can 
apply standard gradient-based techniques, even for high¬ 
dimensional 0 (i.e. many hyperparameters). While the so¬ 
lution is not guaranteed to be globally optimal, since 0 
is not convex, the procedure has been observed to result in 
good estimates which are largely insensitive to the initiali¬ 
zation Eoll . 


2.3. A Gaussian process with groupwise confidences 

Our main contribution in this work is a new approach, 
GP-GC, for handling unreliably annotated data in which 
some training examples are more trustworthy than others. 
Earlier GP-based approaches either assume that the noise 
variance is constant for all training examples, i.e. £ = XI 
for some A > 0, or that the noise variance is a smooth func¬ 
tion of the inputs, £ = diag(g(Xi),... ,p(Xjv)), where 
g is also a Gaussian process function El El- Neither ap¬ 
proach is suitable for our situation: constant noise variance 
makes it impossible to distinguish between more and less 
reliable annotations. Input-dependent noise variance can 
reflect only errors due to image contents, which is not ad¬ 
equate for errors due to an unreliable annotation process. 
Eor example, in image segmentation even identically look¬ 
ing superpixels need not share the same noise level if they 
originate from different images or were annotated by differ¬ 
ent MTurk workers. 

The above insight suggests to allow for arbitrary noise 
levels, £ = diag(ei,..., for all training instances. 
However, without additional constraints this would give too 
much freedom in modelling the data and lead to overflt- 
ting. Therefore, we propose to take an intermediate route, 
based on the idea of estimating label confidence in groups. 
In particular, for image segmentation problems it is suffi¬ 
cient to model confidences for the entire image segmenta¬ 
tion masks, instead of confidences for every individual su¬ 
perpixel. We obtain such a per-image confidence scores 
by assuming that all superpixel labels from the same im¬ 
age share the same confidence value, i.e. £i = £j if X^ 
and Xj belong to the same image. We treat the unknown 
noise levels as hyperparameters and learn their value in the 
way described above. Since our confidence about labels is 






based on the learned noise variances, we also refer to the 
above procedure as “learning label confidence”. We call the 
resulting algorithm Gaussian Process with Groupwise Con¬ 
fidences, or GP-GC. 

Note that we avoid the chicken-and-egg problem men¬ 
tioned in the introduction because we simultaneously ob¬ 
tain hyperparameters 6, in particular the noise levels e = 
[ei,..., £n], and the predictive distribution. 

2.4. Instance reweighting 

For unbalanced dataset, e.g. in the image segmentation 
case, where the background class is more frequent than the 
foreground, it makes sense to balance the data before train¬ 
ing. A possible mechanism for this is to duplicate train¬ 
ing instances of the minority class. Done naively, however, 
this would unnecessarily increase the computational com¬ 
plexity. Instead, we propose a computational shortcut that 
allows to incorporate duplicate instances without overhead. 
Let w G be a vector of duplicate counts, i.e. Wi is the 
number of copies of the training instance X^. Elementary 
transformations reveal that for the mean function Q, a du¬ 
plication of training instances is equivalent to changing each 
hyperparameter Si to £iy/wi. We denote vector of hyperpa¬ 
rameters, where e is scaled by squared root of entries of 
w as To incorporate duplicates into the marginal like¬ 
lihood we also need to scale e by the square root of vec¬ 
tor of duplicate counts. In addition, we need to add some 
terms to the marginal likelihood, resulting in the following 
reweighted marginal likelihood, 

1 ^ 

In Pu, (y 10) = In p(y I ©u,) -f - ^ [In ■ -w^\n£^], (8) 

i=l 

where “=’ means equality up to a constant that does not 
depend on the hyperparameters. 

Note that the above expressions are well-dehned also for 
non-integer weights, w, which gives us not only the pos¬ 
sibility to increase the importance of samples, but also to 
decrease it, if required. 

3. Efficient Implementation 

Gaussian processes have a reputation for being compu¬ 
tationally demanding. Generally, their computational com¬ 
plexity scales cubically with the number of training in¬ 
stances and their memory consumption grows quadratically, 
because they have to store and invert the augmented data 
covariance matrix, Kf. Thus, standard implementations of 
Gaussian processes become computationally prohibitive for 
large-scale datasets. 

Nevertheless, if the sample covariance matrix has a 
low-rank structure, all necessary computations can be car¬ 
ried out much faster by utilizing the Morrison-Sherman- 
Woodbury identity and the matrix determinant lemma ini 


Corollary 4.3.1]. To beneht from this, many techniques for 
approximating GPs by low-rank GPs have been developed 
using, e.g., the Nystrom decomposition ll32l . random sub¬ 
sampling 0, fc-means clustering ll^ , approximate kernel 
feature maps |[T9]|29ll , or inducing points if^ fTSll . 

In this work we follow the general trend in computer vi¬ 
sion and rely on an explicit feature map (obtained from a 
pre-trained deep network ||5]|25l) in combination with a lin¬ 
ear covariance function. This allows us to develop a par¬ 
allel and distributed implementation of Gaussian Processes 
with exact inference, even in the large-scale regime. For¬ 
mally, we use a linear covariance function, k, of the follow¬ 
ing form, 

k{xi,X2) = (9) 

where <() : A” —> is a fc-dimensional feature map with 

k N, and E = diag(crf,..., cr^) S is a di¬ 

agonal matrix of feature scales. The entries of E are as¬ 
sumed to be unknown and included in the vector of hyper¬ 
parameters 9. The feature map (j> induces a feature matrix 
F = ..., (/)(Xjv)] G of the training set. As a 

result, the augmented covariance matrix has a special struc¬ 
ture as sum of a diagonal and a low-rank matrix, 

K£=£:-fF^EF. (10) 

This low-rank representation allows us to store im¬ 

plicitly by storing matrices £, E and F, which reduces the 
memory requirements from 0{N'^) to 0{Nk). Moreover, 
all necessary computations for the predictive distribution 0 
and marginal likelihood 0 require only 0{Nk'^) opera¬ 
tions instead of 0{N^) 021 . 

Computing the gradients 0 with respect to unknown 
hyperparameters generally imposes a computational over¬ 
head that scales linearly with the number of hyperparam¬ 
eters naEoi. For GP-GC, however, we can exploit the 
homogeneous structure of the hyperparameters, 9 = [e, cr], 
where e = [ei,...,e^r] and cr = [ui,... ,ak] for deriving 
an expression for the gradient without such overhead: 

V, lnp(y|0) = diag ((aa^ - , (11) 

lnp(y|0) = diag (F(aaT - i)F^E') , (12) 

where a = K^^y and £' and E' are diagonal matrices 
formed by vectors e and cr respectively. 

The computational bottleneck of low-rank Gaussian pro¬ 
cess learning is constituted by standard linear algebra rou¬ 
tines, in particular matrix multiplication and inversion. 
Thus, a signihcant reduction in runtime can be achieved 
by relying on multi-threaded linear algebra libraries or even 
GPUs. 

3.1. Distributed implementation 

Despite great improvement in performance by utilizing 
a low-rank covariance function and parallel matrix opera- 
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(b) AutoSeg dataset with automatic bounding boxes (depicted in red): horses (top left), dogs (top right), cats (bottom left) and sheep (bottom right). 


Figure 2; Examples of training images and their segmentation masks (marked in purple) for the two datasets used. The 
horizontal bars reflects the quality value GP-GC assigns to the segmentation masks; the length of the bright green stripe 
is proportional to the number of images in the corresponding dataset that are estimated to have lower confidence than the 
depicted image. 


tions, Gaussian processes still remain computationally chal¬ 
lenging for truly large datasets with high-dimensional fea¬ 
ture maps. For example, one of the datasets we use in 
our experiments has more than 100,000 images, 16 million 
superpixels and 4,113-dimensional feature representation. 
Storing the feature matrix alone requires more than 512 GB 
RAM, which is typically not available on a single worksta¬ 
tion, but easily achievable if the representation is distributed 
across multiple machines. 

In order to overcome memory limitations and even fur¬ 
ther improve the computational performance we developed 
a distributed version of low-rank Gaussian processes. It 
relies on the insight is that the feature matrix F itself is 
not required for computing the prediction function Q, the 
marginal likelihood (|^ and its gradient Q, if an oracle is 
available for answering the following four queries: 

i. compute Fu for any v G 

ii. compute F^u for any u G 

iii. compute FUF^ for any diagonal B G 

iv. compute diag(F^ AF) for any A G R^^^. 


See Appendix [A| for detailed explanation. On top of such 
an oracle we need only 0(k^ + N) bytes of memory and 
0{k^ + N) operations to accomplish all necessary compu¬ 
tations, which is orders of magnitude faster than the original 
requirements of 0{Nk) bytes and 0{Nk^) operations. 

Implementing a distributed version of the oracle is 
straightforward; suppose that p computational nodes 
are available. We then split the feature matrix F = 
[Fi, F 2 ,... Fp] into p roughly equally-sized parts. Each 
part is stored on one of the nodes in a distribute way. All 
oracle operations naturally decompose with respect to the 
parts of the feature matrix; 

i- Fv = ELi 

ii. F^u=[{Fjuy,...,iFjuy]^, 

iii. Fi^F^ = F,D,Fj, 

iv. diag(FX4F) = [diag(Ff4Fi)T..., diag(FTAFp)T]T 

where we split the vector v and the diagonal matrix D into 
p parts in the same fashion as we split F, obtaining Vi 
and Di for alH G {1,... ,p}. A master node takes care 
of distributing the objects v, u, D and A over computa¬ 
tional nodes. Each computational node i calculates FiVi, 



























































Method 

SVM 

GP 

GP-GC 

HDSeg dataset 

Horses (19,060) 

82.5 

82.5 

83.7 

Dogs (111,668) 

80.6 

80.5 

81.3 

AutoSeg dataset 

Horses (9,007) 

81.2 

80.3 

82.5 

Dogs (41,777) 

11.\ 

77.1 

79.4 

Cats (3,006) 

73.1 

72.4 

73.5 

Sheep (5,079) 

75.6 

75.4 

80.0 


Table 1: Numerical results (per-class average accuracy in 
%) of GP-GC and baseline approaches. The numbers in 
brackets indicate the number of images in the training sets. 
The best numbers are in bold font, see text for details on 
statistical significance. 

(F^u)^, YiDiFj and diag(F7Ari)^ and sends results 
to the master node, which collects the results of every op¬ 
eration and aggregates them by taking the sum for oper¬ 
ations (i) and (iii) or the concatenation for operations (ii) 
and (iv). The communication between the master node and 
computational nodes requires sending messages of size at 
most 0{k^ -b N) bytes, which is small in relation to the size 
of training data. 

Consequently, our distributed implementation reduces 
the time and per-machine memory requirements by a factor 
of p, at the expense of minor overhead for network commu¬ 
nication and computations on the master node. 

4. Experiments 

We implemented GP-GC in Python, relying the Open- 
Bias librar}0for linear algebra operations and L-BFGS ||2l 
for gradient-based optimization. The code will be made 
publicly available. 

We perform experiments on two large-scale datasets for 
foreground-background image segmentation, see Figure]^ 
for example images. 

1) HDSeg 1(13^ We use the 19,060 images of horses and 
111,668 images of dogs with segmentation masks created 
automatically by the segmentation transfer method CD for 
training. The test images are 241 and 306 manually seg¬ 
mented images of horses and dogs, respectively. 

2) AutoSeg, a new dataset that we collated from public 
sources and augmented with additional annotation^ The 
training images for this dataset are taken from the Ima- 
geNet projec0 There are four categories: horses (9,007 
images), dogs (41,777 images), cats (3,006 images), sheep 
(5,079 images). All training images are annotated with seg- 

^ http://openblas.net 

http: / /ist. ac. at/~ak.olesnik.ov/HDSeg/ 

^We will publish the dataset, including pre-computed features. 

- http://www.image-net.org 


GP 

Classiher 

SVM 

GP-GC 

margin 

conhdence 

HDSeg dataset 

83.5 

84.3 

81.2 

81.7 

AutoSeg dataset 

82.7 

83.9 

79.7 

81.2 

72.5 

73.9 

81.2 

82.9 


SVM 

Classiher 

SVM 

GP-GC 

margin 

conhdence 

HDSeg dataset 

83.8 

84.3 

81.7 

82.0 

AutoSeg dataset 

82.5 

83.2 

79.2 

80.9 

71.9 

72.9 

80.2 

81.7 


Prediction 

model 


Selection 

rule 




Horses 

Dogs 




Horses 

Dogs 

Cats 

Sheep 



Table 2: Numerical results (per-class average accuracy in 
%) of training an SVM or GP on filtered data. An SVM or 
GP classiher is trained on 25% of the most reliable images 
from the dataset, as selected by the SVM margin or GP-GC 
filtering. The best numbers are in bold font, see text for 
details on statistical signihcance. 

mentation masks generated automatically by the GrabCut 
algorithm from the OpenCV librarjQwith default pa¬ 
rameters. We initialize GrabCut with bounding boxes that 
were also generated automatically by the ImageNet Auto- 
Annotation method ll30l| ^ The test set consist of 1001 im¬ 
ages of horses, 1521 images of dogs, 1480 images of cats 
and 489 images of sheep with manually created per-pixel 
segmentation masks that were taken from the validation part 
of the MS CGCG0dataset. 

As evaluation metric for both datasets we use the average 
class accuracy d: we compute the percentage of correctly 
classihed foreground pixels and the percentage of correctly 
classihed background pixels across all images and average 
both values. To assess the signihcance of reported results, 
the above single number is not sufficient. Therefore, we use 
a closely related quantity for this purpose: we compute an 
average class accuracy as above separately for every image 
and perform a Wilcoxon signed-rank test ll3Tll with signih¬ 
cance level 10“^. 

4.1. Image Representation 

We split every image into superpixels using the SLIC 
m method from scikit-imag^ library. Each superpixel 
is assigned a semantic label based on the majority vote 
of pixel labels inside it. For each superpixel we com¬ 
pute appearance-based features using the OverFeat 1i25\ li- 
brar}pq We extract a 4096-dimensional vector from the 

^ http://opencv.org 

'http://groups.inf.ed.ac.uk/calvin/proj-imagenet/page 
^ http://mscoco.org/ 

"http://scikit-image.org 

http://cilvr.nyu.edu/doku.php?id=software:overfeat: 
start 
















































Method 

GP-GC 1 Top-1% 

Top-2% 

Top-5% 

Top-10% 

Top-15% 

Top-25% 

Top-50% 

Top-75% 

HDSeg dataset 

Florses 

83.7 

82.7 

83.0 

83.7 

83.9 

84.1 

84.3 

84.4 

83.9 

Dogs 

81.3 

80.0 

80.3 

80.8 

81.2 

81.4 

81.7 

81.9 

81.6 

AutoSeg dataset 

Florses 

82.5 

82.4 

82.8 

83.3 

83.8 

83.8 

83.9 

83.4 

83.6 

Dogs 

79.4 

77.1 

77.7 

78.9 

80.1 

80.7 

81.2 

80.8 

79.9 

Cats 

73.5 

57.0 

69.3 

72.3 

72.6 

73.3 

73.9 

74.4 

74.3 

Sheep 

80.0 

78.1 

80.4 

81.6 

82.6 

82.5 

82.9 

81.4 

79.4 


Table 3: Numerical results (per-class average accuracy in %) of training GP model from different subsets of training data. 
Column “Top-7%” indicates that we select 7% of the most reliable images according to confidences learned by GP-GC. The 
best numbers are in bold font. 


output of the 20th layer of the pretrained model referred 
to as fast model in the library documentation. Additionally, 
we add features that describe the position of a superpixel 
in its image. For this we split each image into a 4x4 uni¬ 
form grid and describe position of each superpixels by 16 
values, each one specifying the ratio of pixels from the su¬ 
perpixel falling into the corresponding grid cell. We also 
add a constant (bias) feature, resulting in an overall feature 
map, (j) : X ^ K^', with k = 4113. The features within 
each of the three homogeneous groups (appearance, posi¬ 
tion, constant) share the same scale hyperparameter in the 
covariance function (|^, i.e. (Ji = aj if the feature dimen¬ 
sions i and j are within the same group. 

4.2. Baseline approaches 

We compare GP-GC against two baselines. The first 
baseline is also a Gaussian process, but we assume that all 
superpixels have the same noise variance. All hyperparam¬ 
eters are again learned by type-II maximum likelihood. We 
refer this method simply as GP. This baseline is meant to 
study if a selective estimation of the confidence values in¬ 
deed has a positive effect on prediction performance. 

As second baseline we use a linear support vector ma¬ 
chine (SVM), relying on the LibLinear implementation 
with squared slack variables, which is known to deliver 
state-of-the-art optimization speed and prediction qual¬ 
ity 13 . For training SVM models we always perform 5- 
fold cross-validation to determine the regularization con¬ 
stant C G {2-20, 2-19,..., 2-1}. 

4.3. Foreground-background Segmentation 

We conduct experiments on the HDSeg and AutoSeg 
datasets, analyzing the potential of GP-GC for two tasks: 
either as a dedicated method for semantic image segmenta¬ 
tion, or as a tool for identifying reliably annotated images, 
which can be used afterwards, e.g., as a training set for 
other approaches. For all experiments we reweight train¬ 
ing data so that foreground and background classes are bal¬ 


anced and all instances with the same semantic label have 
the same weight, but the overall weight remains unchanged, 
i.e. This step removes the effect of differ¬ 

ent ratios of foreground and background labels for different 
datasets and their subsets. 

The first set of experiments compares GP-GC with 
the baselines, GP and SVM, on the task of foreground- 
background segmentation. Numeric results are presented in 
Table[2 They show that GP-GC achieves best results for all 
datasets and all semantic classes. According to a Wilcoxon 
signed-rank test, GP-GC’s improvement over the baselines 
is significant to the IQ-^ level in all cases. 

We obtain two insights from this. First, the fact that GP- 
GC improves over GP confirms that it is indeed beneficial 
to learn different confidence hyperparameters for different 
images. Second, the results also confirm that classification 
using Gaussian process regression with gradient-based hy¬ 
perparameter selection yields results comparable with other 
state-of-the-art classifiers, such as SVMs, whose regular¬ 
ization parameter have to be chosen by more tedious cross- 
validation procedures. 

In a second set of experiments we benchmark GP-GC’s 
ability to suppress images with unreliable annotation. For 
this, we apply GP-GC to the complete training set and use 
the learned hyperparameter values (see Figure|^for an illus¬ 
tration) to form a new data set that consists only of the 25% 
of images that GP-GC was most confident about. We com¬ 
pare this approach to SVM-based filtering similar to what 
has been done in the computer vision literature before H: 
we train an SVM on the original dataset and form per-image 
confidence values by averaging the SVM margins of the 
contained superpixels. Afterwards we use the same con¬ 
struction as above, forming a new data set from the 25% of 
images with highest confidence scores. 

We benchmark how useful the resulting data sets are by 
using them as training sets for either a GP (with single noise 
variance) or an SVM. Table 1^ shows the results. By com¬ 
paring the results to Table one sees that both methods 





































for filtering out images with unreliable annotation help the 
segmentation accuracy. However, the improvement from 
filtering using GP-GC is higher than when using the data 
filtered by the SVM approach, regardless of the classifiers 
used afterwards. This indicates that GP-GC is a more re¬ 
liable method for suppressing bad annotation. According 
to a Wilcoxon test, GP-GC’s improvement over the other 
method is significant to the 10“^ level in 11 out of 12 cases 
(all except AutoSeg sheep for the SVM classifier). 

To understand this effect in more detail, we performed 
another experiment; we used GP-GC to create training sets 
of different sizes (1% to 75% of the original training sets) 
and trained the GP model on each of them. The results 
in Table show that the best results are consistently ob¬ 
tained when using 25%-50% of the data. For example, 
for the largest dataset (HDSeg dog), the quality of the pre¬ 
diction model keeps increasing up to a training set of over 
55,000 images (8 million superpixels). This shows that hav¬ 
ing many training images (even with unreliable annotations) 
is beneficial for the overall performance and that scalability 
is an important feature of our approach. 

5. Summary 

In this work we presented, GP-GC, an efficient and 
parameter-free method for learning from datasets with un¬ 
reliable annotation, in particular for image segmentation 
tasks. The main idea is to use a Gaussian process to jointly 
model the prediction model and confidence scores of indi¬ 
vidual annotation in the training data. The confidence val¬ 
ues are shared within groups of examples, e.g. all super¬ 
pixels within an image, and can be obtained automatically 
using Bayesian reasoning and gradient-based hyperparam¬ 
eter tuning. As a consequence there are no free parameter 
that need to be tuned. 

In experiments on two large-scale image segmentation 
datasets, we showed that by learning individual confidence 
values GP-GC is able to better cope with unreliable anno¬ 
tation than other classification methods. Furthermore, we 
showed that the estimated confidences allow us to filter out 
examples with unreliable annotation, thereby providing a 
way to create a cleaner dataset that can afterwards be used 
also by other learning methods. 

By relying on an explicit feature map and low-rank ker¬ 
nel, GP-GC training is very efficient and easily imple¬ 
mented in a parallel or even distributed way. For example, 
training with 20 machines on the HDSeg dog segmentation 
dataset, which consists of over 100,000 images (16 million 
superpixels), takes only a few hours. 


A. Reduction to the oracle 


We demonstrate that having the oracle from Section 3.1 


is sufficient to compute the mean function (|^, the marginal 


likelihood (|^, and its gradient (0 without access to the fea¬ 
ture matrix F itself. We highlight terms that oracle can 
compute by braces with the number of the corresponding 
oracle operation. 

We first apply the Sherman-Morrison-Woodbury identity 
and matrix determinant lemma to the matrix : 

= {£ + =S-^C-^F£-\ (13) 

ln|Kf|=ln|£-fFTSF|=ln|£|-fln|E|-fln|C'|, (14) 

where we denote C = -I- Fg~ ^F~^^. 

{Hi) 

For convenience we introduce y — £~^y. Relying on 
( [T3| ) we compute the following expressions: 

y^K^^y = y^y-{Fy)^C-HFy), (15) 

(i) (0 

FKji2/ = ^-^F£:-iFT)C-i(Fy), (16) 

(0 (iii) (0 

FK^^F^ = (F£:-1f^)(I - C-^ {F£-^F^)), (17) 

(iii) (iii) 

(ii) 


a = K^^y = y-£-^F^C-\ 

Fy , 

(18) 

diag(K£ ^) = diag(£:"^)- 

(i) 

(19) 


diag(FTC'-iF) ©diag(£:-^), 

(iv) 


where © is elementwise (Hadamard) vector multiplication. 

Using the above identities, we obtain the mean of the 
predictive distribution. 


m(x) 





and the marginal likelihood. 


( 20 ) 


lnp(y|0) = - i (y^K^ V + In |K£|-fMn(27r)). (21) 

(H) (13 

Finally, we compute the gradient of the marginal likeli¬ 
hood with respect to the noise variances e, 

Ve lnp(y|0) = diag {{aa^ - Kg^)£') (22) 

= (a © a — diag(K^^)) © diag(f') 

(H 

and with respect to the feature scales cr, 

lnp(y|0) = diag (F(aa^ - (23) 

= ((Fa) © (Fa) -diag(FK£ ^F^)) © diag(E'). 
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